by group4
HelocData is the anonymized data of Home Equity Line of Credit (HELOC) loans. This project is to predict the risk flag ('RiskFlag') which indicates whether the loan is ever 90-day delinquent over a two-year period.
Before reading the csv file, it is important to note that
-7: Record or No Investigation -8: Usable/Valid trades or inquiries -9: Condition Not Met (e.g., No Delinquencies, No enquiries)
all represent the missing information. (See HelocDataDict, SpecialValues) For each feature, all these missing values will be treated as NaN.
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/choijaewon959/HELOC-RiskFlag-Prediction/master/HelocData.csv'
df = pd.read_csv(url, na_values = [-7, -8, -9])
df.head()
Change the categorical label to 0 and 1.
df['RiskFlag'] = df['RiskFlag'].replace(['Good', 'Bad'], [1, 0])
df.head()
Now, let's read the explanation data dictionary.
from google.colab import files
uploaded = files.upload()
import io
data_dict = pd.read_excel(io.BytesIO(uploaded['HelocDataDict.xlsx']))
# Heloc_data_dict_excel_path = 'https://github.com/choijaewon959/HELOC-RiskFlag-Prediction/blob/master/HelocDataDict.xlsx'
# data_dict = pd.read_excel(Heloc_data_dict_excel_path, 'Data Dictionary', header=None).iloc[:,:2]
#Heloc_data_dict_xls = pd.ExcelFile('HelocDataDict.xlsx')
#data_dict = pd.read_excel(Heloc_data_dict_xls, 'Data Dictionary').iloc[:,:2]
data_dict
Now, let's take a look at the missing value frequencies before the actual cleaning of the data.
# count the missing value frequency
df_miss_num = df.iloc[:, 1:].isnull().sum()
df_total_num = df.shape[0]
df_miss_freq = df_miss_num / df_total_num
print(df_miss_freq.apply(lambda x: format(x, '.2%')).to_string())
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
plt.figure(figsize=(12, 6))
df_miss_freq.plot.bar(width=0.75, rot=0, ax=plt.gca())
plt.ylabel('Frequencies')
plt.title('Bar Plot of Missing Value Frequencies', fontsize=15)
plt.show()
It is important to note that over half of the values of x9 are missing and therefore imputation of the values based only on existing values may contain biases (For the same reason, x15 and x19 may require other imputation methods or should be dropped).
For the readability of the data, let's use the variable name in the dictionary by merging the description of the data dictionary with the raw data.
var_names = data_dict['Description'][1:].str.split('(\.| \()', expand=True).iloc[:,0]
var_names
df.columns = np.concatenate((['RiskFlag'], var_names), axis=0)
df.head()
df_X = df.iloc[:,1:]
df_y = df.iloc[:,0]
df_X.head()
Below is the statistic of the data.
data_stat = df.describe().T
data_stat
It is important to preprocess the data so that feature extraction or feature engineering process does not result in biased or misleading results. This part includes data cleaning and data imputation with different criteria based on the nature of each feature.
As discussed earlier, x9 (Months Since Most Recent Delinquency) , x15 (Months Since Most Recent Inq excl 7days) and x19 (Net Fraction Installment Burden) will be dropped for having too many missing values.
df = df.drop(columns=['Months Since Most Recent Delinquency', 'Months Since Most Recent Inq excl 7days', 'Net Fraction Installment Burden'])
df.head()
Now, let's take a look at the distribution of the features so that more reasonable impuation methods can be selected.
# histogram for three features
cols = 3
rows = (df.shape[1]-1)//cols+1
fig = plt.figure(figsize=(50, 5*rows))
for i, feature in enumerate(df.columns[1:]):
ax = fig.add_subplot(rows,cols,i+1)
df[feature].hist(bins=15, density=True, color='green', alpha=0.4, rwidth=0.95, figsize=(15,50))
df[feature].plot(kind='density', color='green')
# draw mean and median line
x1 = df[feature].mean(skipna=True)
x2 = df[feature].median(skipna=True)
plt.vlines(x=x1,ymin=0,ymax=0.025,color='r',linestyles=':',label = 'mean')
plt.vlines(x2,0,0.025,'b','-.',label = 'median')
#range of x
x_min = data_stat.loc[feature]['min']
x_max = data_stat.loc[feature]['max']
ax.set_title(feature, fontsize = 10)
plt.legend()
plt.xlim(x_min,x_max)
plt.show()
There are some variables which are relatively more skewed than the others. Imputation of these features with their means may lead to biased results by filling the missing values with larger / smaller values than desired. Therefore, these values will be imputed with their medians.
These features include:
Right skewed
Months Since Oldest Trade Open
Number Satisfactory Trades
Number of Trades Open in Last 12 Months
Net Fraction Revolving Burden
Number Revolving Trades with Balance
Left skewed
Percent Trades Never Delinquent
from sklearn.model_selection import train_test_split
random_seed = 4
df_train, df_test = train_test_split(df, test_size = 0.2, random_state = np.random.seed(random_seed))
df_train.shape
df_test.shape
from sklearn.impute import SimpleImputer
skewed_cols = ['Months Since Oldest Trade Open', 'Number Satisfactory Trades', 'Number of Trades Open in Last 12 Months',
'Net Fraction Revolving Burden', 'Number Revolving Trades with Balance']
#train
imp = SimpleImputer(missing_values=np.nan, strategy="median")
df_train[skewed_cols] = imp.fit_transform(df_train[skewed_cols])
#test
imp_1 = SimpleImputer(missing_values=np.nan, strategy="median")
df_test[skewed_cols] = imp_1.fit_transform(df_test[skewed_cols])
df_train.head()
By looking at the histogram of x10 Max Delq/Public Records Last 12 Months and x11 Max Delinquency Ever, it seems more reasonable to follow the mode of the values because the mode of the data more represents the behavioral trends of the loan for this ordinal features.
ordinal_cols = ['Max Delq/Public Records Last 12 Months', 'Max Delinquency Ever']
imp2 = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
df_train[ordinal_cols] = imp2.fit_transform(df_train[ordinal_cols])
imp2_1 = SimpleImputer(missing_values = np.nan, strategy="most_frequent")
df_test[ordinal_cols] = imp2_1.fit_transform(df_test[ordinal_cols])
df_train.head()
For the other features, Nan values will be imputed with the mean.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
columns = df_train.columns
#note that all the special columns above are already imputed. Therefore not affected in this imputation
sim_imputer = SimpleImputer(missing_values = np.nan, strategy="mean")
df_train = pd.DataFrame(sim_imputer.fit_transform(df_train), columns = columns)
sim_imputer_1 = SimpleImputer(missing_values = np.nan, strategy="mean")
df_test = pd.DataFrame(sim_imputer_1.fit_transform(df_test), columns = columns)
df_train.head()
Data cleaning and imputation process has been completed. Now, for a better performance of the model and iterpretability of the data, it is essential to understand more about the data.
X_train, y_train = df_train.iloc[:,1:], df_train['RiskFlag']
X_train.head()
y_train.head()
X_test, y_test = df_test.iloc[:,1:], df_test['RiskFlag']
X_test.head()
Boxplot will provide the general distribution of each features and outliers.
cols = 4
rows = (df.shape[1]-1)//cols + 1
features = df_train.columns[1:]
fig = plt.figure(figsize=(15, 5*rows))
for i, var_name in enumerate(features):
ax = fig.add_subplot(rows,cols,i+1)
bp = df_train.boxplot(column=var_name, by='RiskFlag',
ax=ax, return_type='dict')
[value.set_color('r') for value in bp[0]['medians']] # set median line color
[value.set_linewidth(2) for value in bp[0]['medians']] # set median linewidth
ax.set_xlabel('')
ax.set_title(var_name, fontsize=10)
plt.suptitle('Box Plot Grouped by RiskFlag', fontsize=15)
plt.tight_layout(rect=[0, 0, 1, 0.97]) # remove extra space
plt.show()
Consolidated version of risk markers obviously is the most important factor deciding the label since the data of two labels show completely different distributions.
Months Since Most Recent Trade Open does not seem to be an important factor deciding good or bad risk since it shows rounghly similar distribution of variables to both labels and has relatively many outliers.
To understand the data and select the meaningful and inpendent features, it is important to study the correlation of the data deeper. Below is the correlation heatmap of the data.
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(15,15))
corr = df_train.corr()
sns.heatmap(corr,xticklabels=corr.columns, yticklabels=corr.columns, annot = True,linewidths=.5, ax=ax)
We can observe some meaningful findings from the heatmap.
(1) Consolidated version of risk markers is the most highly correlated features among all.
(2) Consolidated version of risk markers is positively correlated with the RiskFlag (response/dependent variable). This implies that the higher the value of this feature is, the less risky the loan is.
(3) Other than Consolidated version of risk markers, other features do not seem to be highly correlated with the response.
(4) Meanwhile, Consolidated version of risk markers is relatively highly correleated with the other features such as "Percent Trades Never Delinquent", "Max Delq/Public Records Last 12 Months", "Max Delinquency Ever", "Net Fraction Revolving Burden", "Number Bank/Natl Trades with high utilization ratio", and "Percent Trades with Balance". It can be inferred that the Consolidated version of risks markers is calculated by the values that are correlated with these features.
(5) "Percent Trades Never Delinquent", "Max Delq/Public Records Last 12 Months", "Max Delinquency Ever" all seem to be relatively highly correlated to each other.
(6) "Number of Inq Last 6 Months" and "Number of Inq Last 6 Months excl 7 days" are highly correlated and this intuitively makes sense.
Based on the observations above, consolidated version of risk markers is obviously the credit score of the data instance (Considering that positively correlated with percentage of the trades never delinquent, negatively correlated with utilization ratio and revolving burden).
It is important to note that this feature is relatively highly related to the riskFlag (response) even though the features by which this feature is calculated are much less correlated with the response. This shows that devising a new features based on the existing features may help boosting up the accuracy of the model by providing more correlated feature.
Considering that Consolidated version of risk markers is the credit score and that the data is provided by the FICO, the feature can be assumed to be the FICO score of the loan calculated by the scorecard below.
source: FICO
https://www.fico.com/en/latest-thinking/fact-sheet/understanding-fico-scores
By assuming that the feature is calculated based on the scorecard above and correlations between the data, it can be inferred that Consolidated version of risk markers is calculated based on some values related to Ever Delinquent before (negatively affecting the score: the longer the period of delinquency, the worse the loan is), Utilization ratio, Number of months in file, Number of inquiries before, Number of trades.
Then, it can be said that the features above are already taken into account and therefore, engineering a new feature based on those variables are redundant, increase the multicolinearity between the features, and therefore decrease the interpretability of the model and the data.
Hence, it would be more meaningful to come up with a new feature variable calculated based on the features which are less correlated to the features above. Therefore, it is reasonable to select the features that are intuitively important in explaining the label, but that are not included in calculation of consolidated version of risk markets.
Feature satisfying these criteria is Number Satisfactory Trades. It can be assuemd that higher number of satisfactory trades will positively affect the risk flag (good risk flag). However, consider one data instance (A) has over thousands total trades, but only has 20 satisfactory trades. Assume other data instance has fewer number of satisfactory trades, but with much smaller number of total trades. If only the number of satisfactory trades is considered as the factor affecting the risk, A will be considered to be a data instance who has good risk flag. Here, the result is highly likely to be biased since the percentage of satisfactory trades of A can be significantly lower than the others. Therefore, it seems more reasonable to consider the percentage of satisfactory trades rather than total number of satisfactory trades.
X_train['Satisfactory Trades Percentage'] = X_train['Number Satisfactory Trades'] / X_train['Number of Total Trades']
X_test['Satisfactory Trades Percentage'] = X_test['Number Satisfactory Trades'] / X_test['Number of Total Trades']
#update features
features = X_train.columns
X_train['Satisfactory Trades Percentage']
There are some values bigger than 1 as well as infinity, which is misleading. For these values, data will be imputed as mean.
X_train['Satisfactory Trades Percentage'][X_train['Satisfactory Trades Percentage'] > 1.0 ] = np.nan
X_test['Satisfactory Trades Percentage'][X_test['Satisfactory Trades Percentage'] > 1.0 ] = np.nan
X_train
new_feature_imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
new_feature_imputer2 = SimpleImputer(missing_values=np.nan, strategy="mean")
X_train['Satisfactory Trades Percentage'] = new_feature_imputer.fit_transform(X_train['Satisfactory Trades Percentage'][:, np.newaxis])
X_test['Satisfactory Trades Percentage'] = new_feature_imputer2.fit_transform(X_test['Satisfactory Trades Percentage'][:, np.newaxis])
X_train
Now, let's see the distribution of the new feature so that any further transformation of the data is necessary.
X_train['Satisfactory Trades Percentage'].hist(bins=15, density=True, color='green', alpha=0.4, rwidth=0.95, figsize=(7,7))
X_train['Satisfactory Trades Percentage'].plot(kind='density', color='green')
It seems that most of the data instances have high percentage of satisfactory trades. That is, unless the data instance has extremely low value of trade percentage, newly introduced data is assumed to be not effective in predicting the risk flag.
Roughly 10 features will be selected based on three standards:
(1) Model-free feature selection
(2) Model based feature selection
(3) Correlation and outliers
Univariate feature selection
Univariate feature selection algorithm selects the feature based on univariate statistical tests between each of the feature and response. Espeically for univariate feature selection, 'SelectKBest' with chi-squared test will be used.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
K_best_classifier = SelectKBest(chi2, k=10).fit(X_train, y_train)
mask = K_best_classifier.get_support(indices = True)
features_by_univariate = X_train.columns[mask]
X_train_univariate = pd.DataFrame(K_best_classifier.transform(X_train), columns = features_by_univariate)
X_train_univariate.head()
# pd.DataFrame(X_train_univariate, columns = X_train.columns[mask])
Logistic Lasso
LassoCV will provide the interpretable feature selection by providing how the coefficients of features can be drawn differently as the constaint parameter changes. It is also important to note to standardize the data before actually applying the Logistic Lasso.
#standardize the features
from sklearn.preprocessing import StandardScaler
#train
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train))
X_train_scaled.columns = X_train.columns
#test
scaler2 = StandardScaler()
X_test_scaled = pd.DataFrame(scaler2.fit_transform(X_test))
X_test_scaled.columns = X_test.columns
X_train_scaled.head()
from sklearn import linear_model
cv = 10
lassocv = linear_model.LassoCV(cv=cv, max_iter=5e4)
lassocv.fit(X_train_scaled, y_train)
print('alpha = %.2e' %lassocv.alpha_)
eps = 5e-5 # the smaller it is the longer is the path
alphas_lasso, coefs_lasso, _ = linear_model.lasso_path(X_train_scaled, y_train, eps, fit_intercept=False)
plt.figure(figsize = (20,15))
NUM_COLORS = 21
LINE_STYLES = ['solid', 'dashed', 'dashdot', 'dotted']
NUM_STYLES = len(LINE_STYLES)
cm = plt.get_cmap('gist_rainbow')
for i in range(X_train_scaled.shape[1]):
lines = plt.plot(alphas_lasso, coefs_lasso[i], label = X_train_scaled.columns[i])
lines[0].set_color(cm(i//NUM_STYLES*float(NUM_STYLES)/NUM_COLORS))
lines[0].set_linestyle(LINE_STYLES[i%NUM_STYLES])
plt.xscale('log')
plt.xlabel('Log($\\lambda$)')
plt.ylabel('coefficients')
plt.title('Lasso paths')
plt.legend()
plt.axis('tight')
It can be seen that as the constraint parameter (lambda) decreases, coefficients of different features start to have meaningful values of coefficients.
Notable feature is "Consolidated version of risk markers" which is re-confirmed as the most important feature as in the previous correlation analysis. It can be observed that each time the new feature starts to have its effect as lambda changes, "Consolidated version of risk markers" change trend in its graph. It can be inferred that when the feature that is highly correlated with this feature, the feature may change because newly introduced variable also explains the response that has been previously explained by Consolidated version of risk markers. The graph seems to show the most drastic change at the moment when "Number Bank/Nati Trades with high utilization ratio" and "Percent Trades Never Delinquent" are introduced. This can be explained by the high correlation between these two features and "Consolidated version of risk markers" (See Correlation heatmap above - these two features are most highly correlated).
Furthermore, there is a wide U-shaped curve when the log(lambda) value is 10E-3 generated by two dotted light blue lines. These two features are "Number of Inq Last 6 Months", "Number of Inq Last 6 Months excl 7 days". Intuitively, these two are directly correlated and can be checked in the correlation heatmap as well. Therefore, only one feature, "Number of Inq Last 6 Months" will be selected.
Another interesting observation is that "Percent Trades with Balance" begins to have value and then moves back to value of 0 at the moment "Max Delinquency Ever" begins to have effect, and after a while it begins to have effect again. (ADD MORE DETAIL)
According to the analysis of the graph, features are selected in descending order of importance as follows:
features_by_lasso = ["Consolidated version of risk markers", "Net Fraction Revolving Burden", "Number Satisfactory Trades",
"Average Months in File", "Number of Inq Last 6 Months", "Number Bank/Natl Trades w high utilization ratio",
"Months Since Oldest Trade Open", "Percent Trades Never Delinquent", "Max Delq/Public Records Last 12 Months",
"Max Delinquency Ever"]
X_train_lasso = X_train[features_by_lasso]
X_test_lasso = X_test[features_by_lasso]
X_train_lasso.head()
Permutation Feature Importance
Permutation feature importance can be evaluated by the metric score variation caused by the feature value permutation. XGBoost model will be used to further leverage on permutation object.
# fit a XGBoost
import xgboost as xgb
xgb_clf = xgb.XGBClassifier(n_estimators=300, learning_rate=0.05)
xgb_clf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
y_pred = xgb_clf.predict(X_train)
print("training accuracy by xgboost", accuracy_score(y_pred, y_train))
features
!pip install eli5
import eli5
from eli5.sklearn import PermutationImportance
# define a permutation importance object
perm = PermutationImportance(xgb_clf).fit(X_train, y_train)
# show the importance
eli5.show_weights(perm, feature_names= np.array(features) )
# importance in decreasing order
imp_ord = np.argsort(perm.feature_importances_)
plt.figure(figsize=(9,8))
yaxis = np.arange(len(perm.feature_importances_))*1.2
plt.barh(y = yaxis,width = perm.feature_importances_[imp_ord])
plt.yticks(yaxis,np.array(features)[imp_ord])
plt.ylabel('Feature')
plt.xlabel('Importance')
plt.show()
It can be observed that the feature rankings are different from other two methods. Important features selected by tree-based selection include features as follows:
features_by_tree = ["Consolidated version of risk markers", "Number Satisfactory Trades", "Average Months in File",
"Percent Trades Never Delinquent", "Net Fraction Revolving Burden", "Months Since Oldest Trade Open",
"Percent Installment Trades", "Percent Trades with Balance", "Number Revolving Trades with Balance",
"Max Delq/Public Records Last 12 Months"]
X_train_tree = X_train[features_by_tree]
X_test_tree = X_test[features_by_tree]
X_train_tree.head()
Goal of this project is to achieve a high prediction accuracy and interpretable model at the same time. Therefore, both white-box models and black-box models will be used and provide various interpretations according to the results of fitted models. For white-box models, visualization and model-diagnostic methods wil be discussed. For black-box models, model agnostic post hoc methods such as VI, PDP, ICE, and SHAP will be used.
White-box model includes: GAM
Black-box models include: Gradient Boosting, MLP
Post hoc methods:
Variable Importance: for the forest based fitted models, variable importance will be discussed for the global interpretation.
Partial Dependency Plot (PDP) / Individual Conditional Expectation (ICE): Considering that the features are not highly correlated, dependency plots will provide meaningful insights of features globally.
SHAP: For both global and local interpretation, SHAP will be leveraged to explain how the changes in feature affect the prediction. SHAP will be fully leveraged especially for Deep Neural Network model.
!pip install shap
import shap
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
LogisticGAM will be used as main model along with some feature engineering techniques such as linear splines and B-splines with degree 3.
!pip install pygam
from pygam import LogisticGAM, f, s, l
logit_GAM_linear_spline = LogisticGAM(spline_order = 1)
logit_GAM_linear_spline.gridsearch(X_train_lasso.values, y_train)
logit_GAM_b_spline = LogisticGAM()
# gridsearch to tune the 5 penalty strengths in the current GAM model
logit_GAM_b_spline.gridsearch(X_train_lasso.values, y_train)
print('GAM_linear training accuracy: ', accuracy_score(y_train, logit_GAM_linear_spline.predict(X_train_lasso)))
print('GAM_linear test accuracy: ', accuracy_score(y_test, logit_GAM_linear_spline.predict(X_test_lasso)))
print('GAM_B_spline training accuracy: ', accuracy_score(y_train, logit_GAM_b_spline.predict(X_train_lasso)))
print('GAM_B_spline test accuracy: ', accuracy_score(y_test, logit_GAM_b_spline.predict(X_test_lasso)))
Higher accuracy of B-spline may be due to non-linearity of the features on the label. Therefore, it will be worth discussing more on B-Spline model than on linear spline.
Model Summary
logit_GAM_b_spline.summary()
It can be easily found that first feature (Consolidated version of risk markers) has the lowest p-value. This implies that Consolidated version of risk markers is the most significant among all as expected. Importance of the features can be implied based on these p-values.
Furthermore, pseudo R-squared seems to be in good range that the model is well fitted (see https://stats.stackexchange.com/questions/82105/mcfaddens-pseudo-r2-interpretation).
Partial Dependency Plot (PDP)
# partial dependence plot
fig, axs = plt.subplots(2,5,figsize=(20,7))
for i, ax in enumerate(axs.flatten()):
XX = logit_GAM_b_spline.generate_X_grid(term=i)
plt.subplot(ax)
plt.plot(XX[:,i], logit_GAM_b_spline.partial_dependence(term=i, X=XX))
plt.plot(XX[:,i], logit_GAM_b_spline.partial_dependence(term=i, X=XX, width=.95)[1], c='r', ls='--')
plt.title(X_train_lasso.columns[i])
plt.tight_layout()
The larger the red line (confidence interval of feature) is, the wider the range that we are confident that the sample mean lies in that range. In other words, for wider the confidence interval is, the more ranges that we have to check whether the sample mean actually lies in that range. For example, Consolidated version of risk markers have much narrower confidence interval than Max Delinquency Ever. That is, the sample mean lies in the narrower range of values of Consolidate version of risk markers than Max Delinquency Ever in 95% confidence level. This may be due to the fact that Max Delinquency Ever contains much more outliers than Consolidated version of risk markers (see boxplot above), thereby making it hard to confidently narrow down the possible range which the actual sample mean lies in.
Gradient Boosting will be trained with the features selected by permutation importance with xgboost. For global interpretation methods, Variable Importance (VI) and Partial Dependency Plots (PDP) will be used to explain how overall feature variations affect prediction. For local interpretation, SHAP will be used to explain how local feature change attributes to prediction.
Attempt on fitting gradient boosting with default setting
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve, auc
from matplotlib.legend_handler import HandlerLine2D
gradient_boosting_classifier = GradientBoostingClassifier()
gradient_boosting_classifier.fit(X_train_tree, y_train)
print('gradient boosting training accuracy: ', accuracy_score(y_train, gradient_boosting_classifier.predict(X_train_tree)))
y_pred = gradient_boosting_classifier.predict(X_test_tree)
print('gradient boosting testing accuracy: ', accuracy_score(y_test, y_pred))
Model Evaluation
Since the task is binary classification, AUC is a good option to evaluate the model accuracy. By fitting the model with different parameter values, change in auc curve will be used to decide the optimal values for each parameter. After selecting rough range of parameters, the model will be further tuned with seqMML to find the best combinations of these parameters.
def plot_roc_curve(fpr, tpr):
plt.plot(fpr, tpr, color='orange', label='ROC')
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print(roc_auc)
plot_roc_curve(false_positive_rate, true_positive_rate)
Hyperparameter tuning
Plotting learning rate and min samples leaf graph to find overfitting limit
hyperparamsdict = {
'learning_rate': [1, 0.5, 0.3, 0.25, 0.1, 0.05, 0.01],
'min_samples_leaf': range(20,71,10),
}
for key in hyperparamsdict:
train_results = []
test_results = []
for value in hyperparamsdict.get(key):
temp_model = GradientBoostingClassifier(**{key:value})
temp_model.fit(X_train_tree, y_train)
y_train_pred = temp_model.predict(X_train_tree)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, y_train_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
train_results.append(roc_auc)
y_pred = temp_model.predict(X_test_tree)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
test_results.append(roc_auc)
line1, = plt.plot(hyperparamsdict.get(key), train_results, 'b', label='Train AUC')
line2, = plt.plot(hyperparamsdict.get(key), test_results, 'r', label='Test AUC')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel('AUC score')
plt.xlabel(key)
plt.show()
Set initial learning rate and find optimal number of tree
learning rate was initially set to be 0.25 according to the learning rate graph. However, number of trees found to be 20, lower limit of search range. Therefore, learning rate is lowered.
gb = GradientBoostingClassifier(learning_rate=0.05, min_samples_split=500,min_samples_leaf=50,max_depth=8,max_features='sqrt',subsample=0.8,random_state=3612)
parameter_space = {
'n_estimators': range(20,81,10)
}
gb_clf = GridSearchCV(gb, parameter_space, n_jobs=-1, cv=5)
gb_clf.fit(X_train_tree,y_train)
# All results
means = gb_clf.cv_results_['mean_test_score']
stds = gb_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gb_clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
print('Best parameters found:', gb_clf.best_params_)
print('Best score found:', gb_clf.best_score_)
y_pred = gb_clf.predict(X_test_tree)
print('Accuracy:', accuracy_score(y_test, y_pred).round(6))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
auc_score = auc(false_positive_rate, true_positive_rate)
print('AUC score:', auc_score)
Optimal n_estimators found to be 40, an acceptable number.
Testing max depth and min samples split next.
gb = GradientBoostingClassifier(learning_rate=0.05, n_estimators=40, min_samples_leaf=50,max_features='sqrt',subsample=0.8,random_state=3612)
parameter_space = {
'max_depth':range(3,20,2),
'min_samples_split':range(100,551,50)
}
gb_clf = GridSearchCV(gb, parameter_space, n_jobs=-1, cv=5)
gb_clf.fit(X_train_tree,y_train)
# All results
means = gb_clf.cv_results_['mean_test_score']
stds = gb_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gb_clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
print('Best parameters found:', gb_clf.best_params_)
print('Best score found:', gb_clf.best_score_)
print('Accuracy:', accuracy_score(y_test, gb_clf.predict(X_test_tree)).round(6))
Max depth and min samples split found to be 13 and 350.
Min samples split is logical given the data size, whereas max depth may be a bit large given there are only 10 features.
High max depth is found to lead to overfitting, but it is not shown at this stage.
Testing min samples leaf mainly while allowing some variation to min samples split with above result as reference.
gb = GradientBoostingClassifier(learning_rate=0.05, n_estimators=40, max_depth=13,max_features='sqrt',subsample=0.8,random_state=3612)
parameter_space = {
'min_samples_split': range(250,451,50),
'min_samples_leaf': range(30,71,10),
}
gb_clf = GridSearchCV(gb, parameter_space, n_jobs=-1, cv=5)
gb_clf.fit(X_train_tree,y_train)
# All results
means = gb_clf.cv_results_['mean_test_score']
stds = gb_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gb_clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
print('Best parameters found:', gb_clf.best_params_)
print('Best score found:', gb_clf.best_score_)
print('Accuracy:', accuracy_score(y_test, gb_clf.predict(X_test_tree)).round(6))
Optimal min samples split stays the same at 350 while min samples leaf found to be 60.
Testing max feature next.
gb = GradientBoostingClassifier(learning_rate=0.05, n_estimators=40, max_depth=13, min_samples_split=350, min_samples_leaf=60,subsample=0.8,random_state=3612)
parameter_space = {
'max_features':range(1,10,1)
}
gb_clf = GridSearchCV(gb, parameter_space, n_jobs=-1, cv=5)
gb_clf.fit(X_train_tree,y_train)
# All results
means = gb_clf.cv_results_['mean_test_score']
stds = gb_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gb_clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
print('Best parameters found:', gb_clf.best_params_)
print('Best score found:', gb_clf.best_score_)
print('Accuracy:', accuracy_score(y_test, gb_clf.predict(X_test_tree)).round(6))
Max feature found to be 3, which agrees with common default values of square root of total number of features.
Finally, testing subsample size.
gb = GradientBoostingClassifier(learning_rate=0.05, n_estimators=40, max_depth=13, min_samples_split=350, min_samples_leaf=60, max_features=3, random_state=3612)
parameter_space = {
'subsample': [0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]
}
gb_clf = GridSearchCV(gb, parameter_space, n_jobs=-1, cv=5)
gb_clf.fit(X_train_tree,y_train)
# All results
means = gb_clf.cv_results_['mean_test_score']
stds = gb_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gb_clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
print('Best parameters found:', gb_clf.best_params_)
print('Best score found:', gb_clf.best_score_)
y_pred = gb_clf.predict(X_test_tree)
print('Accuracy:', accuracy_score(y_test, y_pred).round(6))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
auc_score = auc(false_positive_rate, true_positive_rate)
print('AUC score:', auc_score)
Subsample size found to be 0.8, which again is a common default value for tuning gradient boosting.
Determining final learning rate and number of trees, this is determined by halving learning_rate while increasing n_estimators proportionally.
gb_clf = GradientBoostingClassifier(learning_rate=0.01, n_estimators=200, max_depth=13, min_samples_split=350, min_samples_leaf=60, max_features=3, subsample=0.8, random_state=3612)
gb_clf.fit(X_train_tree,y_train)
y_pred = gb_clf.predict(X_test_tree)
print('Accuracy:', accuracy_score(y_test, y_pred).round(6))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
auc_score = auc(false_positive_rate, true_positive_rate)
print('AUC score:', auc_score)
gb_clf = GradientBoostingClassifier(learning_rate=0.005, n_estimators=400, max_depth=13, min_samples_split=350, min_samples_leaf=60, max_features=3, subsample=0.8, random_state=3612)
gb_clf.fit(X_train_tree,y_train)
y_pred = gb_clf.predict(X_test_tree)
print('Accuracy:', accuracy_score(y_test, y_pred).round(6))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
auc_score = auc(false_positive_rate, true_positive_rate)
print('AUC score:', auc_score)
gb_clf = GradientBoostingClassifier(learning_rate=0.0025, n_estimators=800, max_depth=13, min_samples_split=350, min_samples_leaf=60, max_features=3, subsample=0.8, random_state=3612)
gb_clf.fit(X_train_tree,y_train)
y_pred = gb_clf.predict(X_test_tree)
print('Accuracy:', accuracy_score(y_test, y_pred).round(6))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
auc_score = auc(false_positive_rate, true_positive_rate)
print('AUC score:', auc_score)
Accuracy dropped at this point, choosing model using learning rate=0.005 with n_estimators=400 as final model.
Final gradient boosting model:
gradient_boosting_classifier = GradientBoostingClassifier(learning_rate=0.005, n_estimators=400, max_depth=13, min_samples_split=350, min_samples_leaf=60, max_features=3, subsample=0.8, random_state=3612)
gradient_boosting_classifier.fit(X_train_tree,y_train)
print('Training Accuracy:', accuracy_score(y_train, gradient_boosting_classifier.predict(X_train_tree)).round(6))
y_pred = gradient_boosting_classifier.predict(X_test_tree)
print('Test Accuracy:', accuracy_score(y_test, y_pred).round(6))
Both AUC score and test accuracy showed ~0.003 improvement. But they are still relatively low.
Test set accuracy is 0.7089
Model Interpretation
Variable Importance (VI)
importances = np.array(gradient_boosting_classifier.feature_importances_)
arg_sorted = np.argsort(importances)
yaxis = np.arange(len(importances))
plt.barh(y = yaxis, width =importances[arg_sorted])
plt.yticks(yaxis, X_train_tree.columns[arg_sorted])
plt.ylabel('Feature')
plt.xlabel('Importance')
plt.show()
Output does not seem to deviate too much from permutation importances of single tree (see above). Only the orders of Average Months in File and Number Satisfactory Trades are switched. This may be due to the correlation between these two features with the other features deleted after only 10 features are selected.
Partial Dependency Plot (PDP)
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.inspection import partial_dependence
from mpl_toolkits.mplot3d import Axes3D
# ignore warnings for using deprecated partial dependency plot
import warnings
warnings.filterwarnings('ignore')
fig, ax = plot_partial_dependence(gradient_boosting_classifier,
X_train_tree,
features=X_train_tree.columns,
feature_names = features_by_tree
)
fig.set_figwidth(9.5)
fig.set_figheight(9.5)
fig.tight_layout()
Among all, variations in Consolidated version of risk markers and Number satisfactory Trades seem to affect the odds of risk flag the most. It is reconfirmed that Consolidated version of risk markers is one of the most (if not the most) influenctial factor for predition. Furthermore, It is worth noting that extremely high number of satisfactory trades highly affect the odds of risk flag while extremely low value does not tell much about the flag.
Percent Trades Never Delinquent seems to have an interesting plot. Trades never delinquent percentage lower than 55 does not affect the prediction, while that after 60 seems to positively correlated to the prediction. This may imply that if the percentage of never-delinquent trades is lower than 55%, the data instance are likely to have the bad risk flag regardless of how low it is.
Interaction Plots
After undergoing some interaction plots, Net Fraction Revolving Burden and Number of Inq Last 6 Months seem to have meaningful interactions.
fig = plt.figure(figsize=(10,7))
features_plot = ("Average Months in File", "Number Revolving Trades with Balance")
pdp, axes = partial_dependence(gradient_boosting_classifier, X_train_tree, [2,8])
XX, YY = np.meshgrid(axes[0], axes[1])
Z = pdp[0].T
ax = Axes3D(fig)
surf = ax.plot_surface(XX, YY, Z, rstride=1, cstride=1,
cmap=plt.cm.BuPu, edgecolor='k')
ax.set_xlabel(features_plot[0])
ax.set_ylabel(features_plot[1])
ax.set_zlabel('Partial dependence')
ax.view_init(elev=22, azim=122)
plt.colorbar(surf)
plt.suptitle('Partial dependence of Risk Flag \n'
'with Gradient Boosting')
plt.subplots_adjust(top=0.9)
plt.show()
This graph cleary shows the interaction between seemingly uncorrelated features. When Number Revolving Trades with Balance is smaller then 20, higher values of Average Months in File seem to explain the odds of the dependence more than the lower values. However, It can be observed that when the Number Revolving Trades with Balance greater than 20, Average Months in File seem to drastically change its trend. Values smaller than 50 seem to attribute to the odds of the depence much more than higher values.
SHAP
shap.initjs()
# define the explainer
tree_explainer = shap.TreeExplainer(gradient_boosting_classifier)
# calculate the shape value
shap_values = tree_explainer.shap_values(X_train_tree)
shap.initjs()
shap.force_plot(tree_explainer.expected_value, shap_values, X_train_tree, feature_names=X_train_tree.columns)
from tensorflow import keras
from tensorflow.keras import layers
# from keras.optimizers import Adam
from keras.losses import binary_crossentropy
from keras.activations import relu, elu, sigmoid
#scale the features selected by lasso
scaler2 = StandardScaler()
X_train_lasso_scaled = pd.DataFrame(scaler2.fit_transform(X_train_lasso))
X_train_lasso_scaled.columns = features_by_lasso
X_train_lasso_scaled.head()
#get subset of text X
X_test_lasso = X_test[features_by_lasso]
scaler2_1 = StandardScaler()
X_test_lasso_scaled = pd.DataFrame(scaler2_1.fit_transform(X_test_lasso))
X_test_lasso_scaled.columns = features_by_lasso
X_test_lasso_scaled.head()
multi_layer_classifier = keras.Sequential()
multi_layer_classifier.add(layers.Dense(10, input_dim =10, activation = 'relu'))
multi_layer_classifier.add(layers.Dense(50, activation = 'relu'))
multi_layer_classifier.add(layers.Dense(1, activation = 'sigmoid'))
multi_layer_classifier.compile(optimizer=keras.optimizers.Adam(),
loss='binary_crossentropy',
metrics=['accuracy'])
multi_layer_classifier.fit(X_train_lasso_scaled, y_train, epochs=150, batch_size=10,)
MLP_training_score = multi_layer_classifier.evaluate(X_train_lasso_scaled, y_train, batch_size=10)[1]
MLP_test_score = multi_layer_classifier.evaluate(X_test_lasso_scaled, y_test, batch_size=10)[1]
print('MLP training accuracy: ', MLP_training_score)
print('MLP test accuracy: ', MLP_test_score)
!git clone https://github.com/keras-team/keras-tuner.git
!pip install ./keras-tuner
from kerastuner.tuners import RandomSearch
Creating validation data set
X_sequential_train, X_sequential_val, y_sequential_train, y_sequential_val = train_test_split(X_train_lasso_scaled, y_train, test_size=0.2, random_state=10)
def build_model(hp):
model = keras.Sequential()
model.add(layers.Dense(10, input_dim=10, activation='relu'))
model.add(layers.Dense(50, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(
optimizer=keras.optimizers.Adam(
hp.Choice('learning_rate',
values=[1e-2, 1e-3, 1e-4])),
loss='binary_crossentropy',
metrics=['accuracy'])
return model
tuner = RandomSearch(
build_model,
objective='val_acc',
max_trials=5,
executions_per_trial=3,
directory='keras_tuner',
project_name='sequential')
tuner.search(X_train_lasso_scaled.values, y_train.values,
epochs=5,
validation_data=(X_sequential_val.values, y_sequential_val.values))
tuner.results_summary()
multi_layer_classifier = tuner.get_best_models(num_models=1)[0]
multi_layer_classifier.fit(X_train_lasso_scaled.values, y_train.values)
print('Test accuracy:',accuracy_score(y_train.values, multi_layer_classifier.predict(X_train_lasso_scaled.values).round(0)).round(6))
y_pred = multi_layer_classifier.predict(X_test_lasso_scaled.values)
print('Test accuracy:',accuracy_score(y_test.values, y_pred.round(0)).round(6))
Hyperparameter tuning of the mlp model requires deeper fundamental knowledge over deep neural network.
Layers and nodes are arbitrary, whereas other parameters take reference from other examples.
Models built are not tailored to the HELOC data set and thus higher accuracy is not expected.
Due to its limited explainability, Deep Neural Network will be further explained both globally and locally, heavily depending on SHAP values.
deep_explainer = shap.DeepExplainer(multi_layer_classifier, X_train_lasso_scaled)
# calculate the shape value
shap_values = deep_explainer.shap_values(np.array(X_test_lasso_scaled))
Let's take a look at the global interpretation of fitted deep neural network.
shap.summary_plot(shap_values[0], X_test_lasso_scaled)
Importances of the variables are listed from top to bottom (Consolidated version of risk markers explain the output the most).
The output seems intuitively plausible.
Consolidated version of risk markers, Average Month in File, Max Delq Records Last 12 Months, Number Satisfactory Trades affect the SHAP value negatively for having low values of it. Which makes sense since low values of them tend to attribute to high risk, thus pushing the output negatively.
Conversely, for Net Fraction Revolving Burden, utilization ratio attribute to output negatviely when it has high value, since the higher revolving burden and utilization ratio intuitively attribute to high risk.
# partial dependecy plot for all the features selected by lasso
for i, feature in enumerate(X_train_lasso_scaled.columns):
shap.dependence_plot(feature, shap_values[0], X_test_lasso_scaled, feature_names=X_test_lasso_scaled.columns)
plt.show()
Most obvious dependece is found between Consolidated version of risk marker and utilization ratio. Low value of utilization ratio tends to have negative SHAP value while consolidated version or risk markers have low SHAP values. This can be explained by the fact that low utilization ratio means low risk, therefore leading to high credit score, positively affecting the riskflag (good risk flag).
Now, let's interpret the model on a global level. How the each value is affecting the prediction for each data instance? Below is how the 100th data instance is affected by each feature. Each contributions to prediction is calculated via SHAP.
shap.initjs()
shap.force_plot(deep_explainer.expected_value, shap_values[0][100,:], X_test_lasso_scaled.iloc[100,:], feature_names=X_test_lasso_scaled.columns)
It can be easily found that positive version of risk markers positively affects the SHAP, negative fraction revolving burden tends to affect the SHAP positively, posivie percent trades never Delinquent positively affecting output SHAP.
Below is the integration of the force plot for every data instance.
shap.initjs()
shap.force_plot(deep_explainer.expected_value, shap_values[0], X_test_lasso_scaled, feature_names=X_test_lasso_scaled.columns)
Logistic GAM with B spline reported 72.01% training accuracy and 71.89% test accuracy.
Gradient boosting reported 73.78% training accuracy and 70.89% test accuracy.
MLP reported 72.62% training accuracy and 70.79 test accuracy.
All models had similar accuracy (~1% gap). Hyperparameter tuning did not improve accuracy by a significant portion. This leads us to believe to achieve higher accuracy, the effects of better feature selection should be more significant than hyperparameter tuning. Further improvement on feature selection is recommended.
Finally, logistic GAM with b-splines is recommended. It has the highest test accuracy and smallest training accuracy to test accuracy gap. In addition, logistic GAM is a whitebox model. Therefore, it is more interpretable than the other two blackbox models we've used.